Return Home

Download the template notebook here.

1 Why dataviz?

If you don’t know the shape of your data, you might not draw appropriate conclusions about the statistical tests you perform.

Check out the Datasaurus Dozen, which all have the same x/y means and standard deviations, sometimes called the Simpson Paradox:

Dataviz is also engaging, communicative, creative, efficient, and fun.

1.1 Base R plots

There are some ways to create plots quickly, and it’s good to know a little about that before we get too far.

# plot()

We will talk more about what to do with these plots later, but I want to show you a Quantile-Quantile plot (QQ plot) output so you can see how easy it is to make even before we discuss why you’d want to make it.

#qqplot()

#qqnorm()
#qqline()

2 Visual grammar

Using the tidyverse, we can create beautiful, engaging, informative plots. To do so, we will build up the plot in layers. This might seem to be a bit of a faff at first, but it ends up being powerful (and easy once you know how the components work).

# base plot

This is the background of the plot – a check to see that penguins is something that can be plotted from.

To start adding things to the plot, we need to specify what we want the plot to extract from penguins.

# add aesthetics

Now the plot knows a bit more about what we’re asking, but not enough to show up the data. This is how ggplot() differs from just plot(). By itself, plot() infers how we want to display our data. This is great when it’s correct and not great when it’s wrong (which it often is, without additional specifications). In contrast, ggplot() requires the specifications from the start, but they’re integrated more smoothly.

# simple scatter plot

You can get rid of the message at the top (which we don’t care about) by adding , warning = FALSE after the code chunk identifier:

{r, eval=FALSE,, warning = FALSE}

You can do a lot of other neat stuff here, but you’ll have to look that up on your own time.

Up to this point, we’ve done exactly what the simple plot() command can do. Now, we want to go beyond.

# colour species differently

To replicate this plot in base R with plot(), we’d need to manually subset and plot each species separately, which is a pain, doesn’t include all the same options, and yet takes a LOT more code (plus it still doesn’t look as nice, in my opinion):

# plot() is part of base R

So from now on, we’ll be working with ggplot() only.

One more thing: it’s important to make your plots visually accessible to a broad audience. The default ggplot() has a grey background, which ends up causing problems more often than a white background would, so we’ll also always add the theme_bw() option to our plots from here out. It’s optional but recommended.

2.1 Types of plots

Let’s take a look at what’s needed to make a histogram or density plot.

# think about when it's appropriate to use histograms vs density plots

Another common and useful type of plot is the box and whisker plot.

# box plots are easy and demonstrate how crossed designs are easy to plot

Bar plots are sometimes controversial, but they can also be very useful. They take slightly different arguments than other types of plots because of how the bar height is ‘calculated’.

# bar plots can require some extra prep work

You may also want to explore fancier types of plots, or combine types we’ve already encountered. This is easy with ggplot()’s modular construction and visual grammar.

# overlay violin, box, and points

3 Visual analysis

Statistics is (like) dangerous dark magic: if you know what you’re doing, it means you’ve dedicated your life (soul) to it and have no time or capacity to do other things. If you don’t know what you’re doing, it can hurt you or people around you. If you’re somewhere in the middle, it is best to go slow, hedge your bets, and use it judiciously.

Why is statistics something to be wary of?

  • It’s unintuitive. Human brains were not designed to understand proper statistics.
  • Human brains are too good at detecting patterns – even if none exist.
  • Human brains like attributing causality to things regardless of underlying mechanisms.
  • Developing your expertise in your field or subfield of choice is a major undertaking, and statistics is an entirely independent area to develop expertise in as well.
  • Be honest with yourself: how comfortable are you teaching yourself complex maths?
  • Statistics is a tool for:
    1. telling us what we already know
    2. telling us that our squishy human brains are only human (and wrong about what we think we know)

3.1 Continuous data

Let’s create our own toy dataset:

set.seed(18) # 15 16
x = rnorm(50) # 50 random numbers from a normal distribution
y = 2 * x + 5 # for each x, multiply by 2 (slope) and add 5 (intercept)

Here is what this data look like. Too perfect, everything on a perfect line, even with randomness:

# create a table with two columns
# establish the base of a plot
# use points to plot the data
# use a nice theme
# add a red line with the specified slope

Let’s add more noise, like any complex system would have:

e = rnorm(50) # random noise

# realistic model
y2 = 2 * x + 5 + e # slope = 2, intercept = 5, random noise ("error" or epsilon) = e

tbl1 <- tibble(x, y2) # combine into a dataset

How does the new noise change the data?

# using this dataset
# create a base plot with x on the x-axis and y2 on the y axis
# make it pretty
# make it a scatter plot (points)
# add the red line to indicate intended slope and intercept

Is the red line still the best way to approximate this data?

model1 <- lm(y2 ~ x) # calculate slope and intercept automatically

What are the calculated slope and intercept?

coef(model1) # `coef` stands for coefficients

Plot the data with the intended shape of the data (red) and the calculated shape (green):

# using the toy dataset
# establish the base of the plot
# make it pretty
# draw the data as points
# add a vertical blue line at the "intercept" (y axis)
# add the intended shape of the data (red)
# add the calculated shape (green)

How do the intended shape and calculated shape differ? Why?

We can actually extract this numbers (and more!) in a fancy looking output summary:

summary(model1)

Moreover, there’s actually a way to do this calculation within a plot:

# geom_smooth() calculates the simple linear regression within the ggplot() environment

4 Workshop activities

simdat <- read.csv("../data/simulated-data.csv", header = TRUE)

Using the simulated data set from last time, make and interpret the following plots:

  1. For Region 3 (Verb) only, inspect the interaction of age with the conditions created by crossing the Frequency factor (freq) with the Grammaticality factor (gram).
    • Since age is numeric and (ostensibly) continuous, try using geom_smooth(), among other options.
    • You can define the colour and linetype for the two factors if you wish. fill is also available.
      • What other aesthetics could you use?
      • Do they clarify or confuse the graph?
    • How would you interpret these visual results?
    • What properties of the graph are visually useful or not?
    • Try to modify the graph so it is clear and easy to interpret.
  2. Use a box and whisker plot (geom_boxplot()) to plot each of the five regions reactions times, illustrating the four conditions.
    • A useful tool for this is facet_grid() or facet_wrap(). Look them up and learn how to use them.
    • As before, you can specify different aesthetics for the two factors so that they are plotted separately but adjacent.
    • Try plotting one as the x-axis aesthetic and the other as the fill aesthetic, for example.
    • What other ways might you do this? What works best for you?
    • Can this plot help you interpret the results?
  3. Ordinal data can be very tricky to interpret visually and statistically.
    • Ensure the dataset you plot does not have artificially inflated power due to multiple regions.
    • Group and summarise the data in order to pipe a simplified table into ggplot().
    • Plot a stacked bar chart where each bar is the same height and each rating level is a different colour.
    • On the x-axis, find a way to separate the four conditions made by crossing the two factors.
      • You may consider learning how to use interaction() or you may want to mutate() a column in advance.
      • There are many different ways to do this. Investigate some other ways, too.
    • What can you decipher in this graph?
    • Do you think there are any differences in how conditions were rated?

4.1 Learn from someone else’s code

Go through the following code line by line. Using an internet search engine, the R documentation Help window, and selectively changing or commenting out code, identify what each line does. Take notes by adding comments (# like this) after each line.

simdat %>% 
  mutate(region = as.factor(region)) %>% 
  group_by(freq, gram, region) %>% 
  summarise(mean.rt = mean(rt),
            se.rt = sd(rt)/sqrt(n())) %>% 
  ggplot(aes(x = region, 
             y = mean.rt, 
             group = interaction(freq, gram),
             colour = gram, 
             linetype = freq)) +
  theme_bw() +
  geom_point() +
  geom_path() +
  geom_errorbar(aes(ymin = mean.rt - se.rt, 
                    ymax = mean.rt + se.rt), 
                width=.2) +
  scale_x_discrete(labels = c("the", "old", "VERB", "the", "boat")) +
  scale_color_manual(values = c("grey20", "grey60")) +
  ylab("reaction time (ms)") +
  xlab("region of interest") +
  ggtitle("Self-paced reading time across all regions",
          subtitle = "Shaded areas indicate significant main effects") +
  annotate(geom="rect",
           xmin = 2.6, 
           xmax = 3.4, 
           ymin = 350, 
           ymax = 430, 
           alpha = .2) +
  annotate(geom="rect",
           xmin = 4.6, 
           xmax = 5.4, 
           ymin = 390, 
           ymax = 470, 
           alpha = .2) +
  annotate(geom="text",
           x=3, 
           y=427, 
           label="*", 
           size=10) +
  annotate(geom="text",
           x=5, 
           y=467, 
           label="*", 
           size=10) +
  NULL